Librarians and Link Rot: A Comparative Analysis with Some Methodological Considerations

نویسندگان

David Tyler

Beth McNeil

David C. Tyler

چکیده

The longevity of printed guides to resources on the web is a topic of some concern to all librarians. This paper attempts to determine whether guides created by specialist librarians perform better than randomly assembled lists of resources (assembled solely for the purpose of web studies), commercially created guides (‘Best of the web’-type publications), and guides prepared by specialists in library science and other fields. The paper also attempts to determine whether the characteristics of included web resources have an impact on guides’ longevity. Lastly, the paper addresses methodological issues of concern to this and similar studies. Almost since its arrival as a publicly available resource, web users and bibliographers have found themselves frustrated by the instability of the World Wide Web and with the intermittent availability and in many instances the outright disappearance of sites, pages, and other web objects. Perhaps as a result of this nearuniversal frustration, a number of researchers have made attempts to determine just how unstable a resource the World Wide Web is and just how useful, in terms of the longevity of their accuracy, are the guides, finding aids, and bibliographies that have sprung up around it. In reviewing this literature, the authors have had a nagging sense both that the studies that examine web bibliographies may not have taken advantage of some of the discoveries made in studies of the web as an unstable entity and may have been, to some degree, inadvertently biased by either the size or character of their lists/samples or by their methodologies. To strengthen our understanding of the usefulness of web bibliographies over time and to discover and discuss some of the limitations of such studies, we examined a large and various collection of web bibliographies published serially over a eight-year period, the College & Research Libraries News “Internet Resources” columns published from 1994 to 2001 (referred to henceforth as the “C&RL News web bibliographies”). 3.4tyler 11/18/03, 2:36 PM 615 Librarians and Link Rot: A Comparative Analysis with Some Methodological Considerations 616 Review of Literature As mentioned above, many of the studies of website, page, and/or object availability in the literature may be divided into two camps: those primarily interested in the behavior of the entities that make up the World Wide Web and those primarily interested in how the usefulness of the prepared finding aids and bibliographies that have grown up around the web are affected by that behavior. The former studies, let us call them “webdirected,” tend to examine randomly assembled groups of items so that their conclusions can be extrapolated to apply to the web as a whole. These studies tend to be more “diachronic” in their approach: they usually involve the checking of websites, -pages, and/or -objects regularly and continuously over a period to determine whether and how they occur and/or change over time. The latter sort of studies, let us call them “bibliography-directed”, tend to use lists that were consciously assembled, either directly by the researcher or at one remove, and to be more “synchronic” in their approach: they usually involve checking once or twice during brief discrete intervals that the listed websites, -pages, and/or -objects are still available. As one might expect, the web-directed studies show more variety in approach and focus than do the bibliography-directed studies. For example, the 1997 study by Fred Douglis et al., “Rate Change and Other Metrics: a Live Study of the World Wide Web,” which is primarily concerned with web caching, attempted to quantify the rate and extent of changes to web resources by collecting two traces at the Internet connections of two large corporate networks over 17 and 2 days, respectively.1 The larger trace comprised “95,000 records from 465 clients accessing 20,400 distinct servers and referencing 474,000 distinct URLs.”2 In their recurring State of the Web surveys, Terry Sullivan and the site “All Things Web” (ATW), have used ATW’s harvester to randomly gather pages from the web—44 pages in 1997, 213 in 1998, and 200 in 1999, the latest sample year available—and have examined, among other things, changes in average total page size and the “incidence and prevalence of broken hyperlinks.”3 For his study into web-site and -page mortality rates and into the rates and types of change they experience, Wallace Koehler randomly selected 361 sites and 361 pages in late 1996.4 He then checked them regularly over a 53-week period, the sites being checked during the period at three separate intervals and the pages being checked weekly.5 Koehler also published in the same year a piece on the incorporation of web documents into library collections based on his findings.6 He later published a follow-up article that re-examined the behavior of the 361 web pages of the original study so as to provide some insight into the life cycles and change rates of an aging set of web-pages.7 The studies focusing on bibliographies, guides, and/or finding aids are fairly similar in focus and approach. One of the earlier studies, S. Mary P. Benbow’s “File Not Found” focused on two of Benbow’s articles published in 1995 and 1997 in Internet Research: Networking Applications and Policy that contained 74 and 69 URLs, respectively, and were checked for accuracy in late 1997 or early 1998.8 Two later articles, Joel Kitchens and Pixey Anne Mosley’s “Error 404: Or, What Is The Shelf-Life Of Printed Internet Guides?” and Mark Taylor and Diane Hudson’s “‘Linkrot’ And The Usefulness Of Web Site Bibliographies,” examined the accuracy of URLs published in several various bibliographies.9 Kitchens and Mosley reviewed samples from several “Best of the web” 3.4tyler 11/18/03, 2:36 PM 616 David C. Tyler and Beth McNeil 617 books (sample size: 3,941 URLs).10 Taylor and Hudson reviewed the URLs in the C&RL News web bibliographies published between October of 1997 and October of 1998 immediately after the publication of the last article (sample size: 482 out of 510 URLs), with a follow-up review of the active links performed six months later.11 Thomas O’Daniel and Chew Kok Wai studied 3,236 sites submitted by their students for a course in electronic commerce.12 They checked not only link failure at 2 intervals separated by six months but also analyzed for correlation between domain names and the regional allocation of IP addresses.13 Most recently, two researchers at the University of Nebraska, John Markwell and David W. Brooks investigated on a monthly basis, from August 2000 to May 2002, the incidence of “link rot” in the 515 hyper-links contained in the online materials of three graduate-level biochemistry courses created in August of 2000.14 They also have reported their results on-line, and have been the subject of an article by Vincent Kiernan that appeared in The Chronicle of Higher Education’s on-line edition.15 In any attempt at categorization, there are always exceptions to the rule. Although they do not quite fit our model and are concerned with particular collections rather than with web bibliographies or with the web proper, we would also like to draw the readers’ attention to Michael Nelson and Danette Allen’s “Object Persistence and Availability in Digital Libraries,” and Steve Lawrence et al.’s “Persistence of web References in Scientific Research.”16 The former measured, with thrice-weekly checks over slightly more than a year’s time, the persistence and availability of 1,000 objects selected randomly from twenty digital libraries. The latter investigated the continued accuracy of 67,577 URLs cited in research papers using NEC Research Institute’s scientific digital library ResearchIndex (formerly CiteSeer). The results of the studies mentioned above will be integrated later into this paper through analysis and comparison with our own findings. Methods and Definitions of Terms With the assistance of a student worker, we manually attempted to access the 2,729 URLs of the http-based resources (gopher and ftp sites, listservs, and e-mail addresses were ignored) listed in the C&RL News web bibliographies published over the eight year period from 1994–2001, using Microsoft’s Internet Explorer 6.0 with the recommended security settings. The URLs were first checked as a whole in mid-June 2002 so that, for the purposes of our later discussion, each year’s bibliographies would be as a group an average of an even calendar year old. We also performed two follow-up examinations involving two separate portions of the URL lists, six weeks later which will be detailed at the end of this section. In compiling and numerating the lists, URLs with obvious typographical errors were corrected when caught, and duplicate URLs were removed when caught. Also, a few URLs were missed when the authors were first numbering the addresses for inclusion, and while most of these missed URLs were later added to the lists some may still have been missed. If there are discrepancies in the number of URLs recorded in this and in Taylor and Hudson’s study of portions of the 1997 and 1998 C&RL News web bibliographies, they may largely be attributed to these causes. In examining the URL lists, our first intent was to review them for their usefulness—in terms of their incidences of success and failure, their apparent “half-lives,” 3.4tyler 11/18/03, 2:36 PM 617 Librarians and Link Rot: A Comparative Analysis with Some Methodological Considerations 618 and their inferable “rates of decay” and/or “decay curves”17—from the perspective of the casual user. In our first examination we tried to determine if the URLs listed were valid without inquiring into whether the listed sites and pages were still extant elsewhere and could be located with some effort. Our first step was to determine which addresses were “live,” which provided a “re-route,” and which were “dead.” For our purposes, a “live” URL is one that returns the intended site or page as annotated in the C&RL News web bibliographies, a page that redirects the user to a subscription-free registration or login page that subsequently automatically delivers the desired page, or a page that directs the user to a subscription page if the annotation indicates that such a redirection is to be expected. A “re-route” URL is one that either results in one’s being taken to the intended site’s or page’s new and currently correct address automatically or that calls up a page that provides said new and currently correct address. A “dead” URL is one that returns a “404 Not Found,” “403 Forbidden,” or other such error message, that returns a persistent domain name server (DNS) error message, or that fails to meet the criteria for either live or re-route URLs above (e.g., a URL that returns a site or page that is active but that is not the one described, a URL that requires a subscription when such is not indicated by its annotation, which, we recognize, may be an annotator’s error, and so forth). In the case of DNS errors, URLs that returned such errors were rechecked twice, once during the following morning when University web traffic was low and then again three days later. If the error persisted or a wrong page was eventually returned, the site or page was recorded as being dead; if the correct site or page or an accurate re-routing page was eventually returned, the URL’s status was recorded as live or re-route, respectively.18 Our second purpose was to discover whether the lists prepared by specialist librarians for C&RL News were superior in their staying power to those assembled randomly and to those prepared by others. To this end, the incidences of failure from several comparable lists will be presented throughout our discussion. Our third aim was to attempt to investigate whether some of the characteristics of the URLs in question had any bearing on their status as live, re-route, or dead. To accomplish this we disaggregated our lists by three criteria. First, we checked the URLs’ top level domain types (e.g., “com,” “edu,” etc.) and grouped them into one of five types (“TLD type”); second, we assigned the bibliographies one of four broad topical headings and grouped them accordingly (“Topic”); and, third, we recorded the URLs’ serverlevel domain addresses and grouped them into one of three types (“Server Domain Level” or “SDL”). The assigning of URLs to TLD types was largely straightforward for four of the five types: URLs were noted as ending in “.gov,” “.com,” “.edu,” and “.org,” as appropriate. For the fifth grouping, all other types—“.mil,” “.net,” numerical addresses, addresses ending in country-specific TLDs, and so forth—were placed together under the heading “Other,” as the occurrences of each were dwarfed by the occurrences of the other four types. We gave some consideration to reclassifying country-specific two-letter tags that were identifiable as belonging to one of the other types, such as “co.uk” and “net.de,” and to reclassifying those tags whose membership in another type could be inferred, such as McGill University’s homepage’s address (http://www.mcgill.ca/), but decided against doing so to avoid inconsistencies.19 3.4tyler 11/18/03, 2:36 PM 618 David C. Tyler and Beth McNeil 619 The assigning of topical headings to the bibliographies was also rather straightforward. Each bibliography, upon the basis of its subject, was classified under one of four headings: “Sciences,” “Social Sciences,” “Arts and Humanities,” and “Library and Information Science.” The last topic was used to cover not only library and information science web bibliographies but several general, ready-reference web bibliographies that the authors could not justifiably assign to one of the other three groups, as well. Lastly, we grouped the URLs into three types by the third variable whose impact we decided to investigate, server domain level. SDL “Type A” comprises “zero-level” addresses, those with no directory structure (http://aaa.bbb.ccc/) and those addresses with sub-files that returned exactly the same page as their “zero-level” addresses (these addresses usually had just one or two levels and ended in “ . . . /index.html,” “ . . . / default.htm,” and so forth). Addresses assigned to “Type C” were those that specified a port number (e.g., http://aaa.bbb.ccc:8080 ), including the standard port. We considered reclassifying URLs that specified the standard port, but we encountered a few sites that somehow were dead when the port number was included and were live without it and so decided against reclassification. SDL “Type B” comprises all URLs with sub-files and/or file names that did not meet the criteria for Type A and Type C (e.g., http:aaa.bbb.ccc/xxx/ or http://aaa.bbb.ccc/xxx/page.html). As mentioned above, our final step was to perform two follow-up checks on subsets of the original list six weeks after the initial data gathering. For the first follow-up, we re-checked all of the links we had tagged as dead to determine if website and -page intermittency had skewed our initial data collection and to some extent invalidated our findings. Taylor and Hudson, after an interval of six months, had performed a similar follow up with just the URLs that they had recorded as being live to establish a rate of failure over time; Benbow and Kitchens and Mosley performed no follow-ups; O’Daniel and Kok Wai checked all of their sites twice, with the checks being separated by a sixmonth interval; and Markwell and Brooks, much like Koehler, checked the whole of their lists at regular intervals.20 In our follow-up, URLs that returned a once-dead page or that provided an accurate re-route were classified as “undead;” URLs that did not were deemed to be still dead. For the second follow-up, we used MetaCrawler, Dogpile, Google, AltaVista, and HotBot to search for sites and pages from the 1998 bibliographies that were still dead after the performance of the first follow-up.21 The 1998 bibliographies were chosen as they were, at more than three years old, beyond the pale for most of the consulted studies. Searches were performed using just the information provided in the C&RL News web bibliographies’ annotations. Sites and pages were searched for with several engines, and if a site or page as described did not appear in the first 50 results returned by the various engines, it was deemed to be “Lost.” Sites and pages that were returned were classified as “Found.” While additional searching at the higher domain level of the original websites might have yielded additional “found” websites, we chose to search multiple search engines to widen results instead and to stay with the original plan to search solely using C&RL News information. 3.4tyler 11/18/03, 2:36 PM 619 Librarians and Link Rot: A Comparative Analysis with Some Methodological Considerations 620 Findings and Results For the purposes of our discussion, the data and results for the 1994 bibliographies, though included and presented with the later years’, will be largely ignored. The 1994 bibliographies, which contained a scant total of 24 URLs, provide less than 10 percent of the average number of URLs provided by the later years’ bibliographies and are far less varied in their character. Rather than allow an odd year to skew the results, we have chosen to simply present it alongside the others largely without comment. I. Persistence and Failure of URLS, By Year The initial analysis focused on the validity of the C&RL News web bibliographies. As Graph I shows the C&RL News web bibliographies hold up rather well. Our analysis indicated that the strict half-life for these guides, which takes only live URLs into account, is somewhere between 4 and 5 years. Were one to include the semivalid re-route links as well, the soft half-life of the web bibliographies would appear to be just over 6 years. Koehler, whose sample was randomly selected between December 1996 and January 1998, reported a half-life of 2.9 years for his sites (and he had lost 66.6 percent of his sample after four years),22 as compared to 34 percent dead at the four-year mark in this study. Among the researchers working with consciously assembled lists, Benbow also reported a half-life of 3 years for the list of URLs that she selected in mid1995.23 Markwell and Brooks, however, who selected their URLs in August of 2000, five years after Benbow collected hers, posited a half-life of 55-months,24 which is comparable to our initial findings. Our incidences of failure for the oneand two-year-old bibliographies also compare favorably to most of the other studies’ findings (even to those of Taylor and Hudson, who also examined C&RL News web bibliographies). We found that 7 percent of URLs (24) were dead at an average of one year after publication and that 17 percent (73) were dead at two. Markwell and Brooks found 16.5 percent of their links to be non-viable after 13 months.25 Taylor and Hudson found 22.2 percent to be outdated at an average Graph I. Values within parenthesis ( ) indicate the URL total for the year. 3.4tyler 11/18/03, 2:36 PM 620 David C. Tyler and Beth McNeil 621 of one year after publication for their group.26 Both Benbow and Kitchens and Mosley found nearly 30 percent of their URLs to be dead after two years.27 Interestingly, O’Daniel and Kok Wai found just 2.7 percent of their links to be dead 6 months after collection, and Markwell and Brooks later reported 18.6 percent of their links to be dead after 19 months, a change of just 2.1 percent over the 6 months following a rate of decay of 16.5 percent over 13 months.28 Both of the latter studies used URLs collected near the year 2000, as were the last two years’ worth of URLs in our study, and these lower incidences of failure could suggest that the web, though still very unstable, may be becoming a more stable environment, at least insofar as “selected” URLs are concerned. These results also suggest, when one compares the incidences of failure in the randomly selected and consciously assembled topical lists over time, that URL selection is very much a value-adding service, one that is perhaps even more so when the lists are assembled by specialist librarians. Obviously, as a result, findings from examinations of consciously assembled topical lists of URLs should not be used to discuss the behavior of the World Wide Web in general. Surprisingly, Lawrence et al. and Nelson and Allen, whose studies were concerned with URLs and objects from digital libraries, found similar incidences of invalid URLs in their samples. Lawrence et al. found that 23 percent of URLs in their study were invalid after one year and that 54 percent were invalid after six; and Nelson and Allen found that 3.1 percent of objects were no longer available at the end of their 161 day testing period,29 a rate similar to O’Daniel and Kok Wai’s. Their results are a bit surprising, as one would assume that objects’ “being placed in a digital library is indicative of someone’s desire to increase the persistence and availability of an object.”30 It would seem, however, in terms of the percentage of dead URLs, that a managed collection of web-available resources may not perform much better than a collection of URLs that has been carefully assembled by a specialist librarian or other knowledgeable professional. II. By Top Level Domain Type, By Year Our second step was to examine our results by top-level domain type to determine if the sites’ and pages’ types had any bearing on the viability of the listed URLs over time. The C&RL News web bibliographies tend to include URLs from TLD types .com (702) and .edu (715) considerably more often than they include URLs from the TLD types .org (497) and the types in the Other (496) category. They also include .com and .edu more than twice as often as they include .gov URLs (322), which is ironic in that, as graph II reveals, with the exception of the URLs in the Other category, TLD types .com and .edu are fairly consistently the worst performers in terms of percentage of dead URLs while .gov URLs are usually the best. Interestingly, in addition to having a high percentage of live URLs, .gov URLs also had comparably high re-route percentages for the 1999–2001 Obviously, as a result, findings from examinations of consciously assembled topical lists of URLs should not be used to discuss the behavior of the World Wide

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparative Study of Methodological Approach to Models and studies of Information Seeking Behavior of Iranian Researchers

Background and Aim: The aim of this article is to revise the methodological status of some of the most prominent studies and models in the field of Information Seeking Behavior in order to provide Iranian researchers with a brief comparative perception of the field. Method: A literature review approach is applied to identify the research methods and historical origins of each study related to t...

متن کامل

Needed: Global Collaboration for Comparative Research on Cities and Health

Over half of the world’s population lives in cities and United Nations (UN) demographers project an increase of 2.5 billion more urban dwellers by 2050. Yet there is too little systematic comparative research on the practice of urban health policy and management (HPAM), particularly in the megacities of middle-income and developing nations. We make a case for creating a global database on citie...

متن کامل

A Meta-Analysis of the Factors Affecting Drug Trafficking and Strategies to Deal with it during 2006 to 2017 in Tehran *

Objective: The present study aimed to conduct a meta-analysis of factors affecting drug trafficking and strategies to deal with it from 2006 to 2017 in Tehran. Method: The research method was meta-analysis. A total of 18 studies in the field of drug trafficking were identified through searching on the Internet in domestic databases and in organizations and universities in Tehran, and finally, d...

متن کامل

Open Source Integrated Library Management Systems: Comparative Analysis of Koha and NewGenLib

This paper aims to study the open source integrated library management systems,,i.e. Koha and NewGenLib, to inform librarians about what considerations to make when choosing an open source integrated library management system (ILMS) for their library. The paper provides a detailed comparative analysis of both types of software, i.e. Koha and NewGenLib which are undertaken in the study. Both typ...

متن کامل

Methodological considerations in observational comparative effectiveness research for implantable medical devices: an epidemiologic perspective.

Medical devices play a vital role in diagnosing, treating, and preventing diseases and are an integral part of the health-care system. Many devices, including implantable medical devices, enter the market through a regulatory pathway that was not designed to assure safety and effectiveness. Several recent studies and high-profile device recalls have demonstrated the need for well-designed, vali...

متن کامل

The Qualitative Descriptive Approach in International Comparative Studies: Using Online Qualitative Surveys

International comparative studies constitute a highly valuable contribution to public policy research. Analysing different policy designs offers not only a mean of knowing the phenomenon itself but also gives us insightful clues on how to improve existing practices. Although much of the work carried out in this realm relies on quantitative appraisal of the data contained in international databa...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2017

Librarians and Link Rot: A Comparative Analysis with Some Methodological Considerations

نویسندگان

چکیده

منابع مشابه

Comparative Study of Methodological Approach to Models and studies of Information Seeking Behavior of Iranian Researchers

Needed: Global Collaboration for Comparative Research on Cities and Health

A Meta-Analysis of the Factors Affecting Drug Trafficking and Strategies to Deal with it during 2006 to 2017 in Tehran *

Open Source Integrated Library Management Systems: Comparative Analysis of Koha and NewGenLib

Methodological considerations in observational comparative effectiveness research for implantable medical devices: an epidemiologic perspective.

The Qualitative Descriptive Approach in International Comparative Studies: Using Online Qualitative Surveys

عنوان ژورنال:

اشتراک گذاری